TCRconv: predicting recognition between T cell receptors and epitopes using contextualized motifs

Abstract Motivation T cells use T cell receptors (TCRs) to recognize small parts of antigens, called epitopes, presented by major histocompatibility complexes. Once an epitope is recognized, an immune response is initiated and T cell activation and proliferation by clonal expansion begin. Clonal populations of T cells with identical TCRs can remain in the body for years, thus forming immunological memory and potentially mappable immunological signatures, which could have implications in clinical applications including infectious diseases, autoimmunity and tumor immunology. Results We introduce TCRconv, a deep learning model for predicting recognition between TCRs and epitopes. TCRconv uses a deep protein language model and convolutions to extract contextualized motifs and provides state-of-the-art TCR-epitope prediction accuracy. Using TCR repertoires from COVID-19 patients, we demonstrate that TCRconv can provide insight into T cell dynamics and phenotypes during the disease. Availability and implementation TCRconv is available at https://github.com/emmijokinen/tcrconv. Supplementary information Supplementary data are available at Bioinformatics online.

. TCR cross-reactivity in datasets a) VDJdbβ-small, b) VDJdbβ-large, and c) VDJdbaβ-large. Each row of a heat map represents TCRs specific to the corresponding epitope and their fraction recognizing any of the epitopes present in the dataset. The bar plots on the right side of each heatmap show the average number of epitope specificities per TCR recognizing the epitope on the corresponding row. For example, TCRs specific to EBV epitope EBNA3A RLRAEAQVK recognize on average 2.2 different epitopes on (b) dataset VDJdbβ-large and 2.0 on (c) dataset VDJdb⍺β-large. TCRs recognizing certain epitopes have notable cross-reactivity. To highlight them we have marked DENV epitopes with pink, EBV epitopes with blue, and two HIV-1 epitopes (HIV-1 KRWIILGLNK and HIV-1 KRWIIMGLNK ) with green.
Supplementary Fig. S2. Epitope-wise method comparison with respect to AUROC score on (a) VDJdbβ-small and (b) VDJdbβ-large datasets and with respect to average precision (AP) on (c) VDJdbβ-small and (d) VDJdbβ-large datasets. The results are sorted by increasing order of TCRconv predictions. To highlight the accuracies for epitopes with notably cross-reactive TCRs, we have highlighted such epitopes similarly to Supplementary Fig. S1: DENV epitopes with pink, EBV epitopes with blue, and two HIV-1 epitopes (HIV-1 KRWIILGLNK and HIV-1 KRWIIMGLNK ) with green.

B C D
A Supplementary Fig. S4. CDR3 edit distances on VDJdbβ-large from TCRs with chosen specificity to all TCRs with same specificity (red) or to all TCRs with other specificity (grey). Y-axis has log-scale. i. ii.
i. ii.
(A) TCRconv performance in terms of AUROC and AP scores when trained with 139099 TCRs specific to 188 peptide groups from SARS-CoV-2. Mean scores are shown above both boxplots. Each circle represents the score for one peptide group, colored by the genomic region and numbered according to Supplementary  Table S3. (B) TCRconv performance when trained with TCRs specific to 20 best performing peptides groups from SARS-CoV-2 combined with VDJdbβ-large dataset; above results for all 70 peptide (groups) and below for only the 20 SARS-CoV-2 peptides. For SARS-CoV-2 peptides coloring and numbering are the same as in panel (a), other epitopes are white, and the numbering corresponds to Supplementary Table S1. (C) AUROC and AP scores from the model from (a) by the peptides' genome location and the diversity of the TCRs specific to each peptide group by the peptides' genome location.  Each plot consists of a sequence logo and a heatmap of for CDR3 sequences with the most common length specific to an epitope. The height of a letter in a sequence logo corresponds to that amino acids frequency at that position, and the the background color of the letter shows the average saliency for the amino acid at that position. The heatmap shows the saliency values for each CDR3 sequence individually. The sequences are clustered by the similarity of their saliency values, as illustrated by the dendogram on its left side.  Table S7).
(A) Paired CDR3⍺β sequences with the two most common lengths specific to EBV epitope BMLF1 GLCTLVAML , YFV epitope NS4B LLWNGPMAV , or SARS-CoV-2 epitope Spike YLQPRTFLL . The height of a letter in a sequence logo corresponds to that amino acids frequency at that position, and the background color of the letter shows the average saliency for the amino acid at that position. (B) Examples of paired TCR⍺β sequences. Supplementary Table S1. Three datasets of epitope-specific TCR-data collected from VDJdb.The datasets contain epitope-specific TCRs for Cytomegalovirus (CMV), Dengue virus types 1, 2 and 3 (DENV1, DENV2, DENV3-4), Epstein-Barr virus (EBV), Hepatitis C virus (HCV), Human immunodeficiency virus type 1 (HIV-1), Influenza A virus (IAV), Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and Yellow Fever virus (YFV), as well as human stromal antigen 2 (BST2), insulin like growth factor 2 mRNA binding protein 2 (IGF2BP2), melanoma antigen (MLANA), and transketolase (TKT). VDJdbβ-large and VDJdbβ-small were collected in January 2021 and VDJdb⍺β-large in September 2021 which explains why some of the SARS-CoV-2 epitopes are only present in VDJdb⍺β-large.   Supplementary Table S2. Method comparison. Mean AUROC and AP scores for TCRconv, TCRGP, TCRdist, SETE, DeepTCR and ERGO-II from stratified 10-fold cross-validation. Mean AUROC and AP scores are macro averages over all epitopes. Standard deviation is given over all folds and over all epitopes (Epit.), showing that with all methods variation between folds is smaller than variation between different epitopes. With TCRconv we have used protBERT embeddings for CDR3 + full context, meaning that the embedding is first computed for the complete TCR (as defined by the CDR3 and V-and J-genes), but only the parts of the embeddings correponding to the CDR3 are used with the predictor. For TCRGP, DeepTCR and TCRdist the results were computed with models using only CDR3βs or additionally other components of TCRβs. With these methods accuracies were higher when additional components were used. All result figures present the more accurate version of each method. e is the number of epitopes (21 in VDJdbβ-small, 51 in VDJdbβ-large), f is the number of folds (10), e,f is the mean score (AUROC or AP) for epitope e in fold f e is the mean score for epitope e over all folds, f is the mean score for fold f over all epitopes, and is the mean score over all epitopes and folds. Table S4. Healthy control and ImmuneCODE repertoire data used in the analysis for T-cell dynamics during COVID-19 (Fig. 2a). The controls consist of the first 72 TCR repertoires from healthy (CMV-) subjects in cohort 1 in the study of Emerson et al. that had over 250 000 TCRs, number of templates reported, and where the subject is known to be at least 18 years old (which is the age of the youngest subject in the ImmuneCODE data used here). From ImmuneCODE 493 repertoires with over 250 000 TCRs and "Days from diagnosis to sample" reported were selected from four separate datasets. Supplementary

A B
Supplementary Table S7. Average position-wise saliency values for TCRs specific to each epitope in VDJdb⍺β-large dataset. Values are given separately for ⍺and βchains for the CDR3 region and the complete TCR, defined by the V-and J-genes and CDR3.